Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More aliases (for Cameroon, Sudan, Brazil, Congo) #9

Merged
merged 4 commits into from
Mar 23, 2021

Conversation

wu-lee
Copy link

@wu-lee wu-lee commented Mar 19, 2021

Mainly inspired by an encounter with these in the wild:

  • Brasil
  • Cameroun
  • The Sudan
  • Democratic Republic of Congo

However, some extra permutations/variations added which seem valid, having consulted Wikipedia.

@@ -461,7 +461,11 @@ CD:
- Congo-Kinshasa
- DRC
- DR Congo
- Congo, The Democratic Republic Of The
- Congo, Democratic Republic of
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep Congo, The Democratic Republic Of The. It's valid: https://www.iso.org/obp/ui/#iso:code:3166:CD.

Version with parenthesis should be added too but this looks like a broader task for another day.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I meant to add the extra alias, and not delete the existing one. Will fix.

@@ -2394,6 +2407,7 @@ SR:
SS:
aliases:
- S. Sudan
- South Sudan, Republic of
Copy link
Owner

@sshaw sshaw Mar 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we may as well add Republic of South Sudan and Republic of S. Sudan and similar entries for the others? In English Republic of South Sudan is how it would be formally written.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Although "S. Sudan" is a bit of a weird one, adding an extra permutation: abbreviations of "north", "south", etc. An "S." variation is included here and for a few other cases, but not for South Africa, South Georgia, North Macedonia etc.

@sshaw
Copy link
Owner

sshaw commented Mar 20, 2021

Cool thanks for this. Aside from inline comments: Brasil and République démocratique du Congo, etc... There are Portuguese and French respectively but this is en.yml. While I would definitely like to support other languages just having these sprinkled into the file for English normalizations would be a surprise. I can see the argument that one should have the country's name in its official language(s) but then what does that mean for the rest of the countries? Maybe that part is a work in progress?

@sshaw
Copy link
Owner

sshaw commented Mar 20, 2021

While I would definitely like to support other languages just having these sprinkled into the file for English normalizations would be a surprise. I can see the argument that one should have the country's name in its official language(s) but then what does that mean for the rest of the countries? Maybe that part is a work in progress?

There's was also #5

@wu-lee
Copy link
Author

wu-lee commented Mar 20, 2021

Brasil and République démocratique du Congo, etc... There are Portuguese and French respectively but this is en.yml. While I would definitely like to support other languages just having these sprinkled into the file for English normalizations would be a surprise.

I notice that "Côte D'Ivoire" is not English, and in fact many of the names contain other languages? For example, "São Tomé and Príncipe", "Sint Maarten", "Timor-Leste". There are surprises already!

I can see the argument that one should have the country's name in its official language(s) but then what does that mean for the rest of the countries? Maybe that part is a work in progress?

I'd probably argue that for a normalisation tool to be most effective, it needs to recognise all sorts of weirdness? So the "input" labels can't be categorised as any particular language, and in fact could be a mixture, like "Republic of Côte D'Ivoire", "Sint Maarten (Dutch Part)" etc. Outputs are another matter. The aliases seem mostly to be there for recognition, and the canonical names and identifiers are the outputs.

[edit] In my use case, I've been using this gem for normalising into an ISO code, and Carmen for generating names in various languages.

@sshaw
Copy link
Owner

sshaw commented Mar 20, 2021

I notice that "Côte D'Ivoire" is not English, and in fact many of the names contain other languages? > For example, "São Tomé and Príncipe", "Sint Maarten", "Timor-Leste". There are surprises already!

The difference seems to be as discussed in #8: names are often written as they are in their native language.

"Leste" may be Portuguese but "The Democratic Republic of" is not. "République démocratique du Congo" is all in French. "Republic of Côte D'Ivoire" only "Côte D'Ivoire" is French. This is the distinction. Maybe there are exceptions. "Curaçao" maybe since this is Portuguese (amongst others) but in Papiamento it's "Kòrsou".

I'd probably argue that for a normalisation tool to be most effective, it needs to recognise all sorts of weirdness?

Given this code/conversation I think there are some ways to normalize (in general not in the PR): St. => Saint, S. => South, etc...

like "Republic of Côte D'Ivoire", "Sint Maarten (Dutch Part)"

To me this is English with acceptable variants (aliases) of "Cote D'Ivoire" etc...

@wu-lee
Copy link
Author

wu-lee commented Mar 21, 2021

I've re-rolled this PR in the light of the above, see what you think now.

@sshaw
Copy link
Owner

sshaw commented Mar 23, 2021

Great thanks. This also gave me some things to think about and some ideas for code-level normalization.

@sshaw sshaw merged commit c96f1e5 into sshaw:master Mar 23, 2021
- Congo, Democratic Republic of
- Democratic Republic of Congo
- DR Congo
- RD Congo
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

French! Is this intentional? 😅

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mouais?

@wu-lee
Copy link
Author

wu-lee commented Mar 23, 2021

Thanks, feel free to tweak things for consistency, although saying that I do need that damn Cameroun, which might be French...

Possibly helpful for you or others: I discover that the I18n gem's #transliterate can remove accents from text, and Rails' #parameterize can further strip out punctuation. The latter is small enough to steal wholesale, the former not so much. But processing text through those gets rid of a lot of the variation.

@sshaw
Copy link
Owner

sshaw commented Mar 25, 2021

Thanks, feel free to tweak things for consistency, although saying that I do need that damn Cameroun, which might be French...

You could always just use your own YAML file. This was my original ideal I just never coded it in but doing it would be trivial. Maybe less than trivial if we just go with an env var at first.

Possibly helpful for you or others: I discover that the I18n gem's #transliterate can remove accents from text...

Yes I see it's fairly compact will consider this thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants